library(ISLR2)
library(car)
library(mgcv)
library(rgl)
library(splines)
library(pbapply)
In the past two labs we have seen several methods that allow us to fit a smooth regression function \[ y_i=f(x_i)+\varepsilon_i, \quad i=1,\ldots,N \] with \(x_i\in \mathbb{R}\). In today’s lab, we would like to move forward looking at a way to perform multivariate nonparametric regression, that is when \(x_i\in \mathbb{R}^p\), \(p>1\). We will do so by means of an additive approach through the Generalized Additive Models. Let us go back to our well-known Prestige dataset, but let us consider the education variable as well.
with(Prestige, scatterplotMatrix(data.frame(prestige, education, income)))
The relationship between prestige and education seems fairly linear. On the contrary, as we have extensively experienced, prestige and income are quite nonlinearly related.
model_lm=lm(prestige ~ education + income, data=Prestige)
summary(model_lm)
##
## Call:
## lm(formula = prestige ~ education + income, data = Prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.4040 -5.3308 0.0154 4.9803 17.6889
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.8477787 3.2189771 -2.127 0.0359 *
## education 4.1374444 0.3489120 11.858 < 2e-16 ***
## income 0.0013612 0.0002242 6.071 2.36e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.81 on 99 degrees of freedom
## Multiple R-squared: 0.798, Adjusted R-squared: 0.7939
## F-statistic: 195.6 on 2 and 99 DF, p-value: < 2.2e-16
Building prediction surfaces is as simple as it was for the bidimensional case
education.grid=seq(range(Prestige$education)[1],range(Prestige$education)[2],length.out = 100)
income.grid=seq(range(Prestige$income)[1],range(Prestige$income)[2],length.out = 100)
grid=expand.grid(education.grid,income.grid)
names(grid)=c('education','income')
pred=predict(model_lm,newdata=grid)
persp3d(education.grid,income.grid,pred,col='blue',border="black",lwd=0.3)
with(Prestige, points3d(education,income,prestige,col='black',size=5))
model_lm_interaction <- lm(prestige ~ education + income + education:income, data=Prestige)
# model_lm_interaction <- lm(prestige ~ education * income, data=Prestige) # alternatively
summary(model_lm_interaction)
##
## Call:
## lm(formula = prestige ~ education + income + education:income,
## data = Prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.0026 -5.1918 0.3768 4.5915 19.9312
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.207e+01 6.581e+00 -3.353 0.001136 **
## education 5.373e+00 5.797e-01 9.270 4.65e-15 ***
## income 3.944e-03 1.007e-03 3.918 0.000165 ***
## education:income -1.961e-04 7.460e-05 -2.628 0.009958 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.587 on 98 degrees of freedom
## Multiple R-squared: 0.8113, Adjusted R-squared: 0.8055
## F-statistic: 140.5 on 3 and 98 DF, p-value: < 2.2e-16
I manage to capture a bit of the nonlinearity with the interaction term… Can we do better?
pred_interaction=predict(model_lm_interaction,newdata=grid)
persp3d(education.grid,income.grid,pred_interaction,col='grey30')
with(Prestige,points3d(education,income,prestige,col='black',size=5))
We certainly can by employing Generalized Additive Models! The main
reference (also, surprisingly easy and straightforward to use) is Generalized
Additive Models: An Introduction with R by Simon N. Wood. The
package mgcv accompanies the book, it is a very-well
structured and actively maintained package that provides computation for
smoothness estimation for a variety of models: we will only scratch the
surface today! Recall the analytical expression of a GAM:
It is called an additive model because we calculate a separate \(f_{j}\) for each \(X_{j}\), and then add together all of their contributions.
We notice the main features of a GAM:
Operatively, gam() is the function of the
mgcvpackage for fitting these types of models. It works
very much like the glm function for generalized linear
models. Clearly with the former we are allowed to specify diverse
smooths options for the model terms.
model_gam=gam(prestige ~ s(education,bs='cr') + s(income,bs='cr'),data = Prestige)
What s() (smoothing spline fit) does is basically
building a “smooth” term for each of the covariates I am putting in. The
behavior is very similar to smooth.spline we have seen last
time, no need to set the number of knots, nor (like in the
gam package) the equivalent degrees of freedom;
mgcv will take care of everything for you.
With "bs" we indicate the B-spline penalised smoothing
basis to use.
"cr" provides a cubic spline basis defined by a modest
sized set of knots spread evenly through the covariate values. They are
penalized by the conventional integrated square second derivative cubic
spline penalty. The fitting is based on penalized likelihood, where the
term to be penalized is the second derivatives of the smooths.
For more info, run ?smooth.terms
Let us look at the models summary
summary(model_gam)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## prestige ~ s(education, bs = "cr") + s(income, bs = "cr")
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.8333 0.6884 68.03 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(education) 3.154 3.913 39.22 <2e-16 ***
## s(income) 2.996 3.604 16.68 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.837 Deviance explained = 84.7%
## GCV = 51.984 Scale est. = 48.34 n = 102
If you are familiar with the output of a glm object, the
one above should not surprise you that much. The only slight exotic part
may be the “approximate significance of smooth terms”: Effective
degrees of freedom (edf) is a summary statistic of GAM and it
reflects the degree of non-linearity of a curve.
Notice that in this case the adjusted R squared is better, even without the interaction term. Let us look compare the residuals with the quantiles of a normal distribution:
hist(model_gam$residuals)
qqnorm(model_gam$residuals)
Note we could also compare the depths of these residuals with the depths of a sample of a normal distribution with the same mean and variance:
mu.res <- mean(model_gam$residuals)
s.res <- sd(model_gam$residuals)
x <- as.matrix(model_gam$residuals)
y <- as.matrix( rnorm(length(model_gam$residuals), mu.res, sqrt(s.res)) )
DepthProc::ddPlot(x, y,
depth_params = list(method='Tukey'))
## Warning in .recacheSubclasses(def@className, def, env): sottoclasse non definita
## "numericVector" della classe "Mnumeric"; definizione non aggiornata
## DDPlot
##
## Depth Metohod:
## Tukey
Normality test:
shapiro.test(model_gam$residuals)
##
## Shapiro-Wilk normality test
##
## data: model_gam$residuals
## W = 0.98887, p-value = 0.5603
Not bad at all! Notice that mgcv works with smoothing
bases; nevertheless, the beauty of GAMs is that we can use
every type of univariate smoother as building blocks
for fitting an additive model. We would for example like to employ
natural cubic splines for providing nonparametric
components in our GAMs. In this case, we do not actually need anything
new then the old but gold lm function and the basis matrix
generator contained in the splines package.
model_gam_ns <-
lm(prestige ~ ns(education, df = 3) + ns(income, df = 3), data = Prestige)
In most situations, the differences in the GAMs obtained using smoothing splines versus natural splines are small, let us look at the residuals scatterplot for the two models:
plot(model_gam_ns$residuals,model_gam$residuals)
cor(model_gam_ns$residuals,model_gam$residuals)
## [1] 0.988401
Pretty much the same fit!
GAMs make the effect interpretation of each regressor to the response
variable easy to understand. Again, it works very much like the analysis
of a linear model: I look at the contribution of a single predictor
holding all the others fixed. This is graphically best accomplished with
the plot method conveniently provided by the mgcv
package.
plot(model_gam)
The same can be achieved for the natural splines model, with a tiny workaround (programming tip!)
gam::plot.Gam(model_gam_ns, se=TRUE)
Again, we see that there are basically no difference between
model_gam and model_gam_ns.
We have noticed already at the very beginning that the relationship between prestige and education seems fairly linear. We now test whether a smooth function is needed for education or if we can be satisfied with a linear contribution. The reduced model (which is de facto a semiparametric regression) reads:
model_gam_reduced=gam(prestige ~ education + s(income,bs='cr'),data = Prestige)
summary(model_gam_reduced)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## prestige ~ education + s(income, bs = "cr")
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.1342 3.7355 1.107 0.271
## education 3.9764 0.3415 11.644 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(income) 3.353 4.012 15.35 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.825 Deviance explained = 83.3%
## GCV = 54.563 Scale est. = 51.7 n = 102
The linear contribution of education is highly significant, but which
model is better? Let us perform a formal test as we would do if we
needed to perform nested model selection with lm: the
anova function serves the purpose.
anova(model_gam_reduced,model_gam, test = "F")
Of course we could have coded the (parametric) test by ourselves, using as test statistic (cf. this) \[ F = \frac{(SS_1 - SS_2)/(df_1 - df_2)}{SS_2/df_2} \]
N <- nrow(Prestige)
RSS_reduced <- deviance(model_gam_reduced)
RSS_full <- deviance(model_gam)
df_full <- N-sum(summary(model_gam)$s.table[,2])-1 # -1 for the intercept
df_reduced <- N-sum(summary(model_gam_reduced)$s.table[,2])-2 # -2 for the intercept and the linear contribution of education
df_difference <- df_reduced-df_full
F_value_manual <- ((RSS_reduced-RSS_full)/df_difference)/(RSS_full/summary(model_gam)$residual.df)
pf(F_value_manual, df1 = df_difference,df2 = df_full,lower.tail = FALSE)
## [1] 0.02791939
It seems that a nonlinear component for education is needed after all.
Important: if my residuals were not normal I would have needed to perform a non-parametric test, but you would have known how to do it, right?
Performing prediction for GAMs is as simple as it always has been:
pred_gam=predict(model_gam,newdata=grid)
persp3d(education.grid,income.grid,pred_gam,col='grey30')
with(Prestige,points3d(education,income,prestige,col='black',size=5))
We can include interactions in a GAM as well: we could either employ a simpler approach (the one we will subsequently use) by defining an interaction term and smooth it, or we could employ more sophisticated bidimensional splines, like thin plate spline (Wood, 2003). In general, it is better to consider bivariate splines when the two dimensions are tightly related (e.g., different coordinates in the space).
model_gam_inter = gam(prestige ~ s(education, bs = 'cr') +
s(income, bs ='cr') +
s(I(income * education), bs = 'cr'),
data = Prestige)
pred_inter = predict(model_gam_inter,
newdata = data.frame(grid, inter = grid$education * grid$income))
persp3d(education.grid, income.grid, pred_inter, col = 'grey30')
with(Prestige,
points3d(education, income, prestige, col = 'black', size = 5))
And lastly, a GAM with thin-plate splines.
model_gam_tp = gam(prestige ~ s(education, income, bs="tp", m=2), # m for order
data = Prestige)
pred_tp = predict(model_gam_tp,
newdata = data.frame(grid))
persp3d(education.grid, income.grid, pred_tp, col = 'grey30')
with(Prestige,
points3d(education, income, prestige, col = 'black', size = 5))
plot(model_gam_tp)
In this brief section, we present the main proposal in A simple bootstrap method for constructing nonparametric confidence bands for functions (Hall et Horowitz).
For simplicity we focus on the univariate case where we model: \[ y_i=f(x_i)+\varepsilon_i, \quad i=1,\ldots,N \] for which of course we have an estimate \(\hat{f}(x)\) which can be a Kernel regression (local estimate), a step function regression, a natural spline smoothing, et cetera.
Algorithm 1: Nonparametric Inference for Nonparametric Regression
Compute the estimate \(\hat{f}(x)\) and the estimate \(\hat{\sigma}^2\) of the residual variance.
Compute residuals: \[ \tilde{\epsilon}_i = y_i - \hat{f}(x_i) \] their center: \[ \bar{\epsilon} = n^{-1} \sum_i \tilde{\epsilon}_i \] and the centered residuals: \[ \hat{\epsilon}_i = \tilde{\epsilon}_i - \bar{\epsilon} \] And the residual based estimator of \(Var[\epsilon]\): \[ \hat{\sigma}^2 = n^{-1} \sum_i \hat{\epsilon}_i^2 \]
Construction of Bootstrap samples. Set \[ y_i^* = \hat{f}(x_i) + \epsilon_i^* \] where \(\epsilon_i^*\) are obtained by sampling with replacement (resampling) from \(\hat{\epsilon}_i, \ldots, \hat{\epsilon}_N\).
Bootstrap versions of \(\hat{f}\), \(\hat{\sigma}^2\) and \(\mathcal{B}(\alpha)\) (the confidence interval.) \[ \mathcal{B}^*(\alpha) = \{ (x, y): \hat{f}^*(x) - s(\mathcal{X}) \hat{\sigma}^* z_{1-(\frac{\alpha}{2})} \leq y \leq \hat{f}^*(x) + s(\mathcal{X}) \hat{\sigma}^* z_{1-(\frac{\alpha}{2})} \] where \(s(\mathcal{X})\) is a function of the design matrix that ensures that the estimand of the variance is consistent.
Note the use of the normal c.d.f implies the use of a normal approximation.
Estimator of coverage error. For every Bootstrap iteration, we have a confidence interval. We thus measure how many times the points were within the bands: \[ \hat{\pi}(x, \alpha) = \mathbb{P}[(x, \hat{f}(x)) \in \mathcal{B}^*(\alpha)|\mathcal{X}] \simeq B^{-1} \sum_{k=1}^B \mathbb{I}_{\{(x, \hat{f}(x)) \in \mathcal{B}^*_b(\alpha) \}} \]
Constructing final confidence band. Utilising the estimate in \((5)\) build the confidence bands (cf. the paper.)
Dr. Simoni, a data scientist, just had his second kid (a boy, 58 cm tall) and he is interested to know how tall his child will be at 25 years of age, given his height at birth. To do so, he has collected the height at birth of 100 males, alongside the height they reached when they were 25. He has collected these data in the file boyheight.rda, height.25 is the height [cm] at 25 years of age, while height.b is the length of the newborn [cm]. Assume that \[height25 = f (height.b) + \varepsilon.\]
Build a degree 1 regression spline model, with breaks at the 25th and 75th percentile of height.b, to predict the height at 25 from the height at birth. Provide a plot of the regression line, compute the pointwise prediction (round it at the second decimal digit) for the height at 25 of Dr. Simoni newborn child, and calculate, using a bootstrap approach on the residuals, the bias, variance and Mean Squared Error (MSE) of such prediction
Dr. Simoni is not particularly satisfied with the predictions of such a simple model, and would like you to do more. So build a prediction model for the height at 25 based on a smoothing spline of order 4 (select the lambda parameter via Leave-One-Out CV). Report the optimal lambda value (2 decimal digits), provide a plot of the regression line, alongside the point-wise prediction of the height at 25 of Dr. Simoni’s kid, and calculate using a bootstrap approach on the residuals the bias, variance and MSE of such prediction (fix the lambda value to the one obtained via Leave-One-Out CV).
load(here::here("Block III - Nonparametric regression/data/boyheight.rda"))
plot(x=height.b, y=height.25)
# Breaks at the 25th and 75th percentile of *height.b*
knots_heigth <- quantile(height.b,probs = c(.25,.75))
# Build a degree 1 regression spline model
fit_spline <- lm(height.25~bs(height.b,knots = knots_heigth,degree = 1))
new_data_seq <- seq(min(height.b),max(height.b),length.out=100)
# Predict the height at 25 from the height at birth
preds=predict(fit_spline, newdata = list(height.b=new_data_seq),se=T)
se.bands=cbind(preds$fit +2* preds$se.fit ,preds$fit -2* preds$se.fit)
# Plot of the regression line
plot(y=height.25,x=height.b ,cex =.5, col =" darkgrey " )
lines(new_data_seq,preds$fit ,lwd =2, col =" blue")
matlines(new_data_seq,se.bands ,lwd =1, col =" blue",lty =3)
# Pointwise prediction
simoni_weigth <- 58
(pointwise_pred <- predict(fit_spline, newdata = list(height.b=simoni_weigth)))
## 1
## 191.4279
# Bias, variance and Mean Squared Error (MSE) of pointwise_pred
wrapper_f <- function(){
height.25.boot=fitted+sample(residuals,n,replace=T)
new_model=lm(height.25.boot ~ bs(height.b, knots=knots_heigth,degree=1))
predict(new_model, newdata = list(height.b=simoni_weigth))
}
B <- 200
fitted=fit_spline$fitted.values
residuals=fit_spline$residuals
n <- length(height.25)
set.seed(100)
boot_d <- pbreplicate(n = B,expr = wrapper_f(),simplify = "vector")
(variance_pred <- var(boot_d))
## [1] 1.158379
(bias_pred <- pointwise_pred-mean(boot_d))
## 1
## 0.04960756
(MSE_pred <- variance_pred +bias_pred^2)
## 1
## 1.16084
# smoothing spline of order 4 (cv=TRUE for LOOCV)
fit_smooth <- smooth.spline(x = height.b, y=height.25,cv = TRUE)
# optimal lambda value
round(fit_smooth$lambda,2)
## [1] 0.01
# plot of the regression line
plot(y=height.25,x=height.b ,cex =.5, col =" darkgrey " )
lines(fit_smooth$x, fit_smooth$y, col="blue", lwd=2)
fitted_smooth <- predict(fit_smooth, height.b)$y
fitted_res <- height.25-fitted_smooth
pred_smooth_simoni <- predict(fit_smooth, simoni_weigth)
# pointwise prediction of the height at 25 of Dr. Simoni’s kid
pred_smooth_simoni$y
## [1] 191.9669
boot_smooth=numeric(B)
set.seed(100)
for(sample in 1:B){
height.25.boot=fitted_smooth+sample(fitted_res,n,replace=T)
new_model=smooth.spline(y = height.25.boot,x = height.b, lambda = fit_smooth$lambda)
boot_smooth[sample]=predict(new_model, simoni_weigth)$y
}
(variance_pred_smooth <- var(boot_smooth))
## [1] 0.9159435
(bias_pred_smooth <- pred_smooth_simoni$y-mean(boot_smooth))
## [1] -0.3520232
(MSE_pred_smooth <- variance_pred_smooth +bias_pred_smooth^2)
## [1] 1.039864
Dr. Bisacciny, Ph.D is becoming increasingly worried about the fact that the bees that he decided to place near a corn field close to the Italian city of Milan tend to die way earlier than the ones that are placed close to the small village of Jovençan, in Aosta Valley. He suspects that additional factors that influence the survival time of a the bee are its productivity, and its weight. For this reason, he decides to run an experiment, in which he selects 10,000 bees, placing 5,000 of them in Aosta Valley and 5,000 in Milan. Dr. Bisacciny runs the experiment for 50 days, during which he annotates when a bee passes away. After the end of the experiment, he starts to analyse the data. In the file ex02.rda you can find a dataframe with information about the status of the bee (1 if alive, 2 if dead), the survival time of the bee (surv.time), its weight (weight) and productivity (prod) as well as its being in Jovençan or in Milan. Modeling this data as i.i.d. realizations of a four dimensional random variable:
load(here::here("Block III - Nonparametric regression/data/ex02.rda"))
head(bee)
fit_gam <- gam(log(surv.time)~s(weight,bs = "cr")+s(prod,bs = "cr")+location,data = bee)
Estimated model
\[\begin{equation} log(surv.time) =\beta_{0}+f\left(weight\right)+f\left(prod\right)+\beta_1 location_{Milan}+\epsilon \end{equation}\]table_fit_gam <- summary(fit_gam)
# adjusted R2
(r_2_squared <- table_fit_gam$r.sq)
## [1] 0.4569872
# p-values of the tests (parametric coefficients)
table_fit_gam$p.pv
## (Intercept) locationMilan
## 0 0
# p-values of the tests (smooth terms)
table_fit_gam$s.pv
## [1] 0.1002523 0.4254312
plot(fit_gam)
hist(fit_gam$residuals)
# Remove weight
fit_gam_reduced_1 <- gam(log(surv.time)~s(prod,bs = "cr")+location,data = bee)
anova(fit_gam_reduced_1,fit_gam, test = "F")
# Remove prod
fit_gam_reduced_2 <- gam(log(surv.time)~location,data = bee)
anova(fit_gam_reduced_2, fit_gam_reduced_1, test = "F")
I can use a simple linear model with just a term: location. No need to use the gam machinery anymore.
fit_lm <- lm(log(surv.time)~location,data = bee)
# Avg log surv time in Milan
sum(fit_lm$coefficients)
## [1] 1.168164
# mean(log(bee$surv.time[bee$location=="Milan"])) # alternatively
# Avg log surv time in Jovencan
fit_lm$coefficients[1]
## (Intercept)
## 3.010891
# mean(log(bee$surv.time[bee$location=="Jovencan"])) # alternatively
More concise way using tapply and functional
programming
tapply(log(bee$surv.time), bee$location, mean)
## Jovencan Milan
## 3.010891 1.168164